Analysis of Boston data

Overview of data

Boston data is included in R-package as a demonstration or example.

Dataset contains social, environmental and economical information about great Boston area. It includes following variables:

  • crim = per capita crime rate by town
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town
  • chas = Charles River dummy variable
  • nox = nitrogen oxides concentration (parts per 10 million)
  • rm = average number of rooms per dwelling
  • age = proportion of owner-occupied units built prior to 1940
  • dis = weighted mean of distances to five Boston employment centres
  • rad = index of accessibility to radial highways
  • tax = full-value property-tax rate per $10000
  • ptratio = pupil-teacher ratio by town
  • black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • lstat = lower status of the population (percent)
  • medv = median value of owner-occupied homes in $1000s

Structure and the dimensions of the data

Dataset has 14 variables and 506 observations and all variables are numerical.

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506  14

Summary, graphical presentation of data and correlations

As seen in pairs plot, most of the variables are not normally distributed. Most of them are skewed and some of them are bimodal. Correlations between variables are better viewed in correlation plotting, where on the upper-right side the biggest circles indicate highest correlations (blue = positive or red = negative). Corresponding number values are mirrored on the lower-left side.

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  


Standardization

In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table (compare with original summary above).

Variable crime rate has been changed to categorical variable with 4 levels: low, med_low, med_high and high. Each class includes quantile of data (25%).

Train and test sets have been created by dividing original (standardized) data to two groups randomly. 80% belongs to train set and 20% to test set.

      crim                 zn               indus        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
 Median :-0.390280   Median :-0.48724   Median :-0.2109  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
      chas              nox                rm               age         
 Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
 1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
 Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
 Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
      dis               rad               tax             ptratio       
 Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  

Linear discriminant analysis (LDA)

In linear discriminant analysis (LDA) only the train set (80% of data) has been analysed. Target variable is the new categorical variable, crime rate (low, med_low, med_high, high). In LDA model all other variables of the data set are used as predictor variables (see Overview of data).

In biplot below can be seen that variable “rad” (index of accessibility to radial highways) has extremely high influence to LD1 and LD2 if compared to the other variables. In biplot all horizontal vectors describes contribution to LD1 dimension (x-axis) and vertical vectors LD2-dimension (y-axis). Sign of coefficient of linear discriminant determines the direction of vector. The longer the vector, the bigger is influence. Most of the vectors contribute both LD1 and LD2. Because in biplot two dimensions are illustrated, directions of most of variables are in different angles between LD1 and LD 2. For example, in the LDA table below the most significant variable of LD1 “rad” has coefficients LD1 = 3.27 and LD2 = 1.05. They are directly readable as coordinates of the arrow head. Similarly the second most significant variable of LD2, “nox” has its head ccordinates in (-0.69, 0.29). LDA1 alone explains 0.95% of model. LD2 explains 3% and LD3 only 1%.

Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2549505 0.2475248 0.2425743 0.2549505 

Group means:
                  zn      indus         chas        nox         rm
low       0.92104479 -0.9126143 -0.081207697 -0.8836562  0.4511518
med_low  -0.07454775 -0.2618785  0.003267949 -0.5995638 -0.1440819
med_high -0.38923530  0.1101206  0.209764839  0.3009757  0.2056717
high     -0.48724019  1.0170891 -0.081207697  1.0513605 -0.3686680
                age        dis        rad        tax     ptratio
low      -0.8674806  0.8740330 -0.6986570 -0.7404913 -0.45706193
med_low  -0.3748251  0.4351617 -0.5534930 -0.4961350 -0.05983227
med_high  0.3894962 -0.3501508 -0.4111534 -0.3380374 -0.23068070
high      0.8070771 -0.8623228  1.6384176  1.5142626  0.78111358
               black       lstat        medv
low       0.37969838 -0.77606084  0.53976567
med_low   0.36047995 -0.12968145 -0.01357018
med_high  0.09979493 -0.07031796  0.20324164
high     -0.75734598  0.89208473 -0.71601163

Coefficients of linear discriminants:
                LD1         LD2          LD3
zn       0.07792721  0.64288658 -0.902748272
indus    0.04959931 -0.20772648  0.617025867
chas    -0.11465325 -0.04849382  0.072183179
nox      0.40281708 -0.87907748 -1.267917206
rm      -0.12150992 -0.13807916 -0.176351216
age      0.19993306 -0.33397579 -0.174569825
dis     -0.04670761 -0.24812892  0.395086209
rad      3.31750902  0.90879821 -0.005502969
tax      0.08021468  0.08036632  0.335706010
ptratio  0.11111058 -0.02898242 -0.257307110
black   -0.12728145  0.04958585  0.163897982
lstat    0.24809721 -0.14089168  0.598299636
medv     0.21743004 -0.35199369 -0.003050695

Proportion of trace:
   LD1    LD2    LD3 
0.9573 0.0299 0.0128 

Predictive power of the model

In the test dataset catecorigal crime variable has been removed. In the table below true values of the original test data and predicted values of the test data (crime removed) are cross-tabulated. Total amount of observations is 102 (506/5 +1). In the table on diagonal axis (from top-left corner) are true values (sum = 76) and all others are predicted values (sum = 26). Prediction error is 26/102 ≈ 0.25

          predicted
correct    low med_low med_high high Sum
  low       17       7        0    0  24
  med_low    3      14        9    0  26
  med_high   1       3       22    2  28
  high       0       0        0   24  24
  Sum       21      24       31   26 102

Calculation of distances between the observations and optimal number of clusters

In this model euclidean distance matrix has been calculated. Results can be seen in table below. By using K-means algorithm, the optimal number of clusters can be investigated. When TWSS (total within sum of squares) drops significally, it indicates optimal number of clusters. In this case optimal number of clusters is 2 or 3. In the first plotting, data has classified into two and in the second plotting three clusters.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1343  3.4625  4.8241  4.9111  6.1863 14.3970 

Bonus

Here LDA is calculated with the clusters as target classes. All other variables in the Boston data are predictor variables. In LDA tables and biplots, differences between number of clusters can be seen. Depending on number of clusters, meaningful variables are different, as seen in plottings.

Call:
lda(clu3 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
        1         2         3 
0.3083004 0.4130435 0.2786561 

Group means:
        crim         zn      indus        nox         rm         age
1  0.8647905 -0.4872402  1.0949412  1.1524404 -0.4717159  0.78718751
2 -0.3763256 -0.3947158 -0.1595373 -0.2992872 -0.2772450  0.02424663
3 -0.3989734  1.1241495 -0.9749470 -0.8314162  0.9328502 -0.90687091
         dis        rad        tax     ptratio      black       lstat
1 -0.8532750  1.3172714  1.3530841  0.57186480 -0.6936034  0.88830984
2  0.0304962 -0.5856776 -0.5482746  0.09789185  0.2769803 -0.03824517
3  0.8988453 -0.5892747 -0.6843385 -0.77780359  0.3568316 -0.92612124
        medv
1 -0.6893319
2 -0.1525634
3  0.9888052

Coefficients of linear discriminants:
                LD1        LD2
crim     0.03403154  0.1907082
zn       0.04097709  0.8724398
indus   -0.41998138 -0.1847419
nox     -1.06730082  0.6654667
rm       0.26776386  0.5104809
age      0.21123453 -0.4670361
dis     -0.07658281  0.3338914
rad     -1.17679746  0.3612912
tax     -0.96644205  0.5042929
ptratio -0.06864005 -0.1601563
black    0.07085932 -0.0359700
lstat   -0.26379244  0.3189355
medv     0.01076674  0.6395889

Proportion of trace:
   LD1    LD2 
0.8729 0.1271 
Call:
lda(clu4 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
        1         2         3         4 
0.4011858 0.1660079 0.3221344 0.1106719 

Group means:
        crim         zn      indus        nox         rm         age
1 -0.3793592 -0.3541732 -0.2369409 -0.3788647 -0.3076659 -0.09191194
2 -0.4124621  1.9031602 -1.0764395 -1.1428600  0.6095020 -1.39546865
3  0.8082769 -0.4872402  1.1165562  1.1413403 -0.4676591  0.79696608
4 -0.3587926 -0.1526454 -0.7764063 -0.2344411  1.5622579  0.10664320
         dis        rad        tax     ptratio      black       lstat
1  0.1379588 -0.5920713 -0.5838167  0.08129234  0.2784401 -0.06159606
2  1.5205894 -0.6250261 -0.5943244 -0.67561813  0.3537454 -0.90687350
3 -0.8539968  1.2199444  1.2927317  0.58616084 -0.6486732  0.87910380
4 -0.2952439 -0.4671119 -0.7549503 -0.98740427  0.3481389 -0.97522401
        medv
1 -0.1690186
2  0.6658984
3 -0.7034406
4  1.6613592

Coefficients of linear discriminants:
                LD1          LD2         LD3
crim    -0.03692074 -0.063843487  0.17437690
zn      -0.10469860 -1.680125461 -0.02034829
indus    0.62305615 -0.375784264 -0.51749840
nox      1.07818726 -0.501408664  0.49444666
rm      -0.13015732 -0.062534323  0.64219861
age     -0.18778510  0.593603356  0.12209172
dis      0.01495296 -0.528845086 -0.16022047
rad      0.71521529  0.091708443  0.26711660
tax      0.86706461 -0.832332218  0.24904935
ptratio  0.21355380 -0.110643988 -0.18716017
black   -0.01780187  0.008088268 -0.03064577
lstat    0.23409949 -0.104352523  0.29032605
medv    -0.14661672  0.063424182  0.97435297

Proportion of trace:
   LD1    LD2    LD3 
0.7143 0.2084 0.0773 
Call:
lda(clu5 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
         1          2          3          4          5 
0.10671937 0.06916996 0.39328063 0.23913043 0.19169960 

Group means:
        crim         zn      indus        nox         rm        age
1 -0.2753323 -0.4872402  1.5337294  1.1273809 -0.6003284  0.9334996
2  1.4802645 -0.4872402  1.0149946  0.9676887 -0.2969389  0.7656016
3 -0.3884901 -0.3308141 -0.4873088 -0.4761310 -0.2318056 -0.1989968
4 -0.3981339  1.2930469 -0.9902994 -0.8283387  1.0566896 -0.9121115
5  0.9128084 -0.4872402  1.0149946  1.0333132 -0.4012324  0.7501117
         dis        rad         tax    ptratio      black      lstat
1 -0.8995039 -0.6096828  0.01485481 -0.3917541 -0.1262348  0.6474932
2 -0.8580043  1.6596029  1.52941294  0.8057784 -3.2970564  1.1699052
3  0.2356027 -0.5732709 -0.60914944  0.1061891  0.3164212 -0.1478389
4  0.9121798 -0.5955687 -0.67325561 -0.8788401  0.3554817 -0.9544043
5 -0.8108798  1.6596029  1.52941294  0.8057784  0.1673459  0.7112530
        medv
1 -0.4618433
2 -1.0473739
3 -0.1012054
4  1.0995470
5 -0.5289453

Coefficients of linear discriminants:
                LD1         LD2         LD3         LD4
crim     0.11743258  0.04770450 -0.14300153 -0.16024852
zn       0.34092540  0.12344732 -0.08107374 -1.23725268
indus    0.53753143 -1.85979858  0.92454526 -1.10944073
nox     -0.00906438 -0.29233935  0.67637180 -0.69383481
rm      -0.05331958  0.27825890 -0.27651832 -0.52558718
age     -0.02982587 -0.24810645  0.11616057  0.17202368
dis     -0.18913313 -0.03732336  0.16315838 -0.26855975
rad      5.88697673  2.04722673  0.02514581  0.50932714
tax      0.13814687  0.14410388  0.09950265 -0.25979580
ptratio  0.20734779  0.06893525 -0.07716785  0.22476182
black   -0.33376654  1.18823488  2.14340012 -0.03614042
lstat   -0.01997449  0.13151481 -0.19758426 -0.25835224
medv    -0.11300541  0.24641561 -0.16457165 -0.62281824

Proportion of trace:
   LD1    LD2    LD3    LD4 
0.8192 0.0840 0.0681 0.0287 
Call:
lda(clu6 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
         1          2          3          4          5          6 
0.30237154 0.12845850 0.21146245 0.09683794 0.05731225 0.20355731 

Group means:
        crim         zn      indus        nox         rm        age
1 -0.3975627 -0.1637000 -0.5873353 -0.6615915 -0.1682976 -0.6574663
2 -0.4140702  2.2813035 -1.1562550 -1.1768217  0.7293996 -1.4086475
3 -0.3231486 -0.4822312  0.6308150  0.5041272 -0.5223445  0.7753457
4 -0.3680230 -0.1494734 -0.7440331 -0.2107180  1.7049351  0.1966439
5  3.0022987 -0.4872402  1.0149946  1.0593345 -1.3064650  0.9805356
6  0.5173303 -0.4872402  1.0149946  1.0036872 -0.1109215  0.6904986
         dis        rad        tax     ptratio       black      lstat
1  0.5448232 -0.5720261 -0.6911565 -0.06248294  0.36026575 -0.3806108
2  1.5645584 -0.6656010 -0.5702843 -0.80946918  0.35416061 -0.9741045
3 -0.5711540 -0.5943977 -0.1915536  0.08745080  0.03673407  0.5492158
4 -0.3113322 -0.5037339 -0.7871603 -1.09274698  0.34883211 -0.9623279
5 -1.0484716  1.6596029  1.5294129  0.80577843 -1.19066142  1.8708759
6 -0.7599982  1.6596029  1.5294129  0.80577843 -0.62752658  0.5406100
         medv
1  0.01419929
2  0.81491803
3 -0.48624639
4  1.73167306
5 -1.31020021
6 -0.48514538

Coefficients of linear discriminants:
                LD1         LD2          LD3         LD4         LD5
crim     0.26905214 -0.04038901  0.952783084  1.05716987  0.83157632
zn       0.02554270 -1.56332967  0.861975978 -1.07176445  0.45885648
indus    0.26308114  0.34083276  0.299613031 -0.64713459 -0.27149224
nox     -0.03462213  0.07174078 -0.011080924 -0.44577196  0.17727824
rm      -0.09400180 -0.18081151 -0.748312466 -0.28307867  0.35366014
age      0.09743790  0.59122392 -0.124953000 -0.52102866  0.63971252
dis     -0.28508543 -0.26165892 -0.004961456  0.09291479 -0.31814880
rad      5.99463004 -1.10320047 -1.528777174  0.82447556 -1.05043782
tax      0.09961052 -0.43262170  0.468017911 -0.81595269  0.38580292
ptratio  0.25749486  0.08897531  0.282642121 -0.34518855 -0.02514549
black   -0.04609198 -0.01324538 -0.083403025  0.01031609 -0.10644781
lstat    0.09758566  0.09732902  0.327335484  0.20747689  0.64677037
medv    -0.06198719 -0.18784402 -0.206662871  0.28964586  0.98112363

Proportion of trace:
   LD1    LD2    LD3    LD4    LD5 
0.8529 0.0830 0.0298 0.0185 0.0159 

Super-Bonus

Data points are of course in same positions. Grouping differs slightly in main group if colours are coded either by crime or by cluster. In the separate group high-crime is well isolated whereas in clusters, there are two of them. If colours are coded by crime, particularly the high-crime is better gathered to one group.

Train data classified by Crime (1 = low, 4 = high)
Train data classified by Clusters